In this unit we discuss several (adaptive) MCMC strategies. We will implement these proposals on a logit regression model and using the “famous” Pima indian dataset. The purpose of this unit is mainly presenting the code of the implementation, not benchmarking the performance of the various MCMC algorithms. Refer to the nice paper by Chopin & Ridgway (2017) for a more comprehensive discussion on this aspect.

The Pima indian dataset

In first place, let us load the Pima.tr and Pima.te datasets which are available in the MASS R package. As described in the documentation, this dataset includes:

A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.

See also the documentation of the Pima.tr and Pima.te datasets for a more comprehensive description of the involved variables. We combine the training and test dataset into a single data.frame as follows:

Pima <- rbind(MASS::Pima.tr, MASS::Pima.te)

We are interested in modeling the probability of being diabetic (variable type) as a function of the covariates.

y <- as.numeric(Pima$type == "Yes") # Binary response
X <- model.matrix(type ~ . - 1, data = Pima) # Design matrix
X <- cbind(1, scale(X)) # Standardize the data

Let \(\textbf{y} = (y_1,\dots,y_n)^\intercal\) be the vector of the observed binary responses (variable type) and let \(\textbf{X}\) be the corresponding design matrix whose generic row is \(\textbf{x}_i = (x_{i1},\dots,x_{ip})^\intercal\), for \(i=1,\dots,p\). We employ a generalized linear model such that

\[ y_i \mid \pi_i \overset{\text{ind}}{\sim} \text{Bern}(\pi_i), \qquad \pi_i = g(\eta_i), \qquad \eta_i = \beta_1x_{i1} + \cdots + \beta_p x_{ip}, \] where the link function \(g(\cdot)\) is either inverse logit transform (plogis function) or the cumulative distribution function of a standard normal (pnorm function). We wish to estimate the parameter vector \({\bf \beta} = (\beta_1,\dots,\beta_p)^\intercal\). To fix the ideas, we are interested in the Bayesian counterpart of the following models for binary data:

# Probit model
fit_probit <- glm(type ~ X - 1, family = binomial(link = "probit"), data = Pima)
# Logit model
fit_logit <- glm(type ~ X - 1, family = binomial(link = "logit"), data = Pima)

The focus of this presentation is on the computational aspects, so we will employ a relatively vague prior centered at \(0\), namely \(\beta \sim N(0, 100 I_p)\). The reader should be aware that much better alternatives exist for example if one aims at specifying an “uninformative” prior.

In first place, we specify the log-likelihood and the log-posterior functions. We consider here the logistic case.

# Loglikelihood of a logistic regression model
loglik <- function(beta, y, X) {
  eta <- c(X %*% beta)
  sum(y * eta - log(1 + exp(eta)))
}

# Logposterior
logpost <- function(beta, y, X) {
  loglik(beta, y, X) + sum(dnorm(beta, 0, 10, log = T))
}

Metropolis-Hastings

# R represent the number of samples
# burn_in is the number of discarded samples
# S is the covariance matrix of the multivariate Gaussian proposal
RMH <- function(R, burn_in, y, X, S) {
  p <- ncol(X)
  out <- matrix(0, R, p) # Initialize an empty matrix to store the values
  beta <- rep(0, p) # Initial values

  # Eigen-decomposition
  eig <- eigen(S, symmetric = TRUE)
  A1 <- t(eig$vectors) * sqrt(eig$values)

  # Starting the Gibbs sampling
  for (r in 1:(burn_in + R)) {
    beta_new <- beta + c(matrix(rnorm(p), 1, p) %*% A1)
    alpha <- min(1, exp(logpost(beta_new, y, X) - logpost(beta, y, X)))
    if (runif(1) < alpha) {
      beta <- beta_new # Accept the value
    }
    # Store the values after the burn-in period
    if (r > burn_in) {
      out[r - burn_in, ] <- beta
    }
  }
  out
}

Naive implementation

library(coda)
R <- 30000
burn_in <- 30000
set.seed(123)

S <- diag(1e-3, ncol(X))

# Running the MCMC
fit_MCMC <- as.mcmc(RMH(R, burn_in, y, X, S)) # Convert the matrix into a "coda" object

# Diagnostic
effectiveSize(fit_MCMC) # Effective sample size
    var1     var2     var3     var4     var5     var6     var7     var8 
318.2368 207.2123 300.9986 328.1145 198.1866 215.9027 333.1073 174.9089 
R / effectiveSize(fit_MCMC) # Integrated autocorrelation time
     var1      var2      var3      var4      var5      var6      var7      var8 
 94.26943 144.77904  99.66825  91.43151 151.37246 138.95146  90.06107 171.51783 
1 - rejectionRate(fit_MCMC) # Acceptance rate
    var1     var2     var3     var4     var5     var6     var7     var8 
0.719124 0.719124 0.719124 0.719124 0.719124 0.719124 0.719124 0.719124 

# Traceplot of the intercept
plot(fit_MCMC[, 1:2])

Laplace approximation

set.seed(123)

p <- ncol(X)
S <- 2.38^2 * vcov(fit_logit) / p

# Running the MCMC
fit_MCMC <- as.mcmc(RMH(R, burn_in, y, X, S)) # Convert the matrix into a "coda" object

# Diagnostic
effectiveSize(fit_MCMC) # Effective sample size
    var1     var2     var3     var4     var5     var6     var7     var8 
1193.711 1106.886 1224.056 1239.297 1184.106 1143.590 1219.163 1244.564 
R / effectiveSize(fit_MCMC) # Integrated autocorrelation time
    var1     var2     var3     var4     var5     var6     var7     var8 
25.13171 27.10306 24.50867 24.20728 25.33557 26.23318 24.60704 24.10482 
1 - rejectionRate(fit_MCMC) # Acceptance rate
     var1      var2      var3      var4      var5      var6      var7      var8 
0.2745758 0.2745758 0.2745758 0.2745758 0.2745758 0.2745758 0.2745758 0.2745758 

# Traceplot of the intercept
plot(fit_MCMC[, 1:2])

Adaptive Metropolis-Hastings

We consider here a version of the Adaptive Metropolis (AM) algorithm of Haario, Saksman, and Tamminem (2001). The algorithm proceeds as in the standard Metropolis algorithm, but the covariance matrix of the proposal is updated at each iteration.

We use the following (adaptive) proposal distribution:

\[q_r({\bf \beta}^* \mid {\bf \beta}) \sim N({\bf \beta}, \:2.38^2 / p \: \Sigma_r + \epsilon I_p),\]

where \(\Sigma_r\) is the covariance matrix of the previously \(r\) sampled values \(\beta^{(1)},\dots,\beta^{(r)}\) and \(\epsilon > 0\) is some small value that avoid degeneracies (i.e. the matrix \(\Sigma_r\) must be invertible). Here, we will use \(\epsilon = 10^{-6}\). Moreover, note that the following recursive formula holds true: \[ \Sigma_r = \frac{1}{r}\sum_{j=1}^r(\beta^{(j)} - \bar{\beta}^{(r)})^\intercal(\beta^{(j)} - \bar{\beta}^{(r)}) = \frac{r - 1}{r}\Sigma_{r-1} + \frac{1}{r}(\beta^{(r)} - \bar{\beta}^{(r-1)})^\intercal(\beta^{(r)} - \bar{\beta}^{(r-1)}). \]

where \(\bar{\beta}^{(r)} = (r - 1)/r \bar{\beta}^{(r-1)} + \beta^{(r)} / r\) is the arithmetic means of the first \(r\) sampled values. This facilitates the computation of \(\Sigma_r\). The code for this AM algorithm is given in the following chunk.

# R represent the number of samples
# burn_in is the number of discarded samples
# S is the covariance matrix of the multivariate Gaussian proposal
RMH_Adaptive <- function(R, burn_in, y, X) {
  p <- ncol(X)
  out <- matrix(0, R, p) # Initialize an empty matrix to store the values
  vars <- matrix(0, R + burn_in, p) # Just for teaching purposes - store the variances
  beta <- rep(0, p) # Initial values
  epsilon <- 1e-6

  # Initial matrix S
  S <- diag(epsilon, p)
  Sigma_r <- diag(0, p)
  mu_r <- beta

  for (r in 1:(burn_in + R)) {

    # Updating the covariance matrix
    Sigma_r <- (r - 1) / r * Sigma_r + tcrossprod(beta - mu_r) / r
    mu_r <- (r - 1) / r * mu_r + beta / r
    S <- 2.38^2 * Sigma_r / p + diag(epsilon, p)

    # Eigen-decomposition
    eig <- eigen(S, symmetric = TRUE)
    A1 <- t(eig$vectors) * sqrt(eig$values)

    beta_new <- beta + c(matrix(rnorm(p), 1, p) %*% A1)
    alpha <- min(1, exp(logpost(beta_new, y, X) - logpost(beta, y, X)))
    if (runif(1) < alpha) {
      beta <- beta_new # Accept the value
    }

    vars[r, ] <- diag(Sigma_r) # For teaching purposes, let us store the variances

    # Store the values after the burn-in period
    if (r > burn_in) {
      out[r - burn_in, ] <- beta
    }
  }
  list(beta = out, vars = vars)
}
set.seed(123)
# Running the MCMC
fit_MCMC <- RMH_Adaptive(R = R, burn_in = burn_in, y, X)
beta_MCMC <- as.mcmc(fit_MCMC$beta) # Convert the matrix into a "coda" object

# Diagnostic
effectiveSize(beta_MCMC) # Effective sample size
     var1      var2      var3      var4      var5      var6      var7      var8 
1284.6719 1036.3597 1184.7249 1039.2479  869.2933  985.3165 1008.0534  891.9833 
R / effectiveSize(beta_MCMC) # Integrated autocorrelation time
    var1     var2     var3     var4     var5     var6     var7     var8 
23.35227 28.94748 25.32233 28.86703 34.51079 30.44707 29.76033 33.63292 
1 - rejectionRate(beta_MCMC) # Acceptance rate
     var1      var2      var3      var4      var5      var6      var7      var8 
0.2057735 0.2057735 0.2057735 0.2057735 0.2057735 0.2057735 0.2057735 0.2057735 

# Traceplot of the intercept
plot(beta_MCMC[, 1:2])


# Traceplot of the variance of some variables
par(mfrow = c(2, 1))
plot(fit_MCMC$vars[, 1], type = "l", main = "Variance of variable 1", ylab = "Variance of variable 2")
plot(fit_MCMC$vars[, 2], type = "l", main = "Variance of variable 2", ylab = "Variance of variable 3")

Metropolis within Gibbs

RMH_Gibbs <- function(R, burn_in, y, X, se) {
  p <- ncol(X)
  out <- matrix(0, R, p) # Initialize an empty matrix to store the values
  beta <- beta_new <- rep(0, p) # Initial values

  for (r in 1:(burn_in + R)) {
    for (j in 1:p) {
      beta_new[j] <- beta[j] + rnorm(1, 0, se[j])
      alpha <- min(1, exp(logpost(beta_new, y, X) - logpost(beta, y, X)))
      if (runif(1) < alpha) {
        beta[j] <- beta_new[j] # Accept the value
      }
    }

    # Store the values after the burn-in period
    if (r > burn_in) {
      out[r - burn_in, ] <- beta
    }
  }
  out
}
p <- ncol(X)
se <- sqrt(rep(1e-4, p))

set.seed(123)
# Running the MCMC
fit_MCMC <- as.mcmc(RMH_Gibbs(R = R, burn_in = burn_in, y, X, se)) # Convert the matrix into a "coda" object

# Diagnostic
effectiveSize(fit_MCMC) # Effective sample size
    var1     var2     var3     var4     var5     var6     var7     var8 
45.95430 41.38699 40.84355 46.78707 37.52987 28.46185 48.15595 41.35603 
R / effectiveSize(fit_MCMC) # Integrated autocorrelation time
     var1      var2      var3      var4      var5      var6      var7      var8 
 652.8225  724.8655  734.5102  641.2028  799.3633 1054.0427  622.9759  725.4081 
1 - rejectionRate(fit_MCMC) # Acceptance rate
     var1      var2      var3      var4      var5      var6      var7      var8 
0.9543651 0.9538651 0.9536985 0.9538985 0.9526318 0.9530984 0.9556985 0.9540318 

# Traceplot of the intercept
plot(fit_MCMC[, 1:2])

Adaptive Metropolis within Gibbs

RMH_Gibbs_Adaptive <- function(R, burn_in, y, X, target = 0.44) {
  p <- ncol(X)
  out <- matrix(0, R, p) # Initialize an empty matrix to store the values
  beta <- beta_new <- rep(0, p) # Initial values

  epsilon <- rep(0, p) # Initial value
  accepted <- numeric(p) # Vector of accepted values
  batch <- 1

  for (r in 1:(burn_in + R)) {

    # Do we need to update the parameters?
    if (batch == 50) {
      for (j in 1:p) {
        # ADAPTATIO
        if ((accepted[j] / 50) > target) {
          epsilon[j] <- epsilon[j] + min(0.01, sqrt(1 / r))
        } else {
          epsilon[j] <- epsilon[j] - min(0.01, sqrt(1 / r))
        }
      }
      # Restart the cycle - Erase everything
      accepted <- numeric(p) # Vector of accepted values
      batch <- 0
    }

    # Increment the batch
    batch <- batch + 1

    for (j in 1:p) {
      beta_new[j] <- beta[j] + rnorm(1, 0, exp(epsilon[j]))
      alpha <- min(1, exp(logpost(beta_new, y, X) - logpost(beta, y, X)))
      if (runif(1) < alpha) {
        beta[j] <- beta_new[j] # Accept the value
        accepted[j] <- accepted[j] + 1
      }
    }

    # Store the values after the burn-in period
    if (r > burn_in) {
      out[r - burn_in, ] <- beta
    }
  }
  out
}
set.seed(123)
# Running the MCMC
fit_MCMC <- as.mcmc(RMH_Gibbs_Adaptive(R = R, burn_in = burn_in, y, X)) # Convert the matrix into a "coda" object

# Diagnostic
effectiveSize(fit_MCMC) # Effective sample size
     var1      var2      var3      var4      var5      var6      var7      var8 
 973.6199  599.0175  787.8033 1096.0456  671.1191  597.0799  906.0121  585.4803 
R / effectiveSize(fit_MCMC) # Integrated autocorrelation time
    var1     var2     var3     var4     var5     var6     var7     var8 
30.81285 50.08201 38.08057 27.37112 44.70145 50.24453 33.11214 51.23998 
1 - rejectionRate(fit_MCMC) # Acceptance rate
     var1      var2      var3      var4      var5      var6      var7      var8 
0.4472149 0.4471816 0.4511817 0.4503150 0.4485483 0.4493483 0.4490150 0.4490816 

# Traceplot of the intercept
plot(fit_MCMC[, 1:2])